What is claimed is: 



1 1 . A method of detecting a process failure in a distributed system, the method comprising 

2 steps of: 

3 (1) measuring a first period of time between an instance a last heartbeat was received 

4 from a first process and a later instance in time; 

5 (2) measuring a second period of time between an instance a last heartbeat was received 

6 from a second process and said later instance in time; 

7 (3) comparing said first and second periods of time with a predetermined threshold; and 
ji (4) determining whether a process failure occurred in response to said comparison in step 
fi (3). 

iy 

HI 2. The method of claim 1 , wherein step (3) further comprises steps of: 

s2= calculating a difference between said first period of time and said second period of time; 

•3' and 

f§ comparing said difference to said predetermined threshold. 

ill 3. The method of claim 2, wherein step (4) further comprises steps of: 
:=2- detecting a failure of said second process in response to said difference exceeding said 

3 predetermined threshold. 

1 4. The method of claim 1, wherein said steps are performed as computer-executable 

2 instructions on a computer-readable medium. 

1 5. The method of claim 1, wherein said distributed system includes one network. 

16. A method of detecting a network failure in a distributed system, the method comprising 

2 steps of: 

3 (1) determining whether a heartbeat is received from at least one process in the 

4 distributed system prior to an expiration of a heartbeat timeout; and 
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5 (2) detecting a failure of a network in said system in response to not receiving said 

6 heartbeat from said at least one process prior to said expiration of said heartbeat timeout. 

1 7. The method of claim 6, wherein said steps are performed as computer-executable 

2 instructions on a computer-readable medium. 

1 8. The method of claim 6, wherein said distributed system includes one network. 

19. A distributed system including a plurality of hosts connected via a network, wherein each 

2 host executes a process in said distributed system, said system comprising: 

3i a first host of said plurality of hosts executing a first process; wherein said first is 

ff: operable to detect one of failure of a second process executing on second host and failure of said 

|5| network based on a period of time to receive a heartbeat transmitted from at least one of said 

K plurality of hosts. 

flj 10. The system of claim 9, further comprising: 

p a third host of said plurality of hosts executing a third process; wherein said first host is 

|3j operable to measure a first period of time between an instance a last heartbeat was received from 

■4f said third host on said network and a later instance in time and measure a second period of time 

5 between an instance a last heartbeat was received from said second host and said later instance in 

6 time; 

7 said first host being further operable to compare said first and second periods of time 

8 with a predetermined threshold, and detect a failure of said second process in response to said 

9 comparison. 

1 11. The system of claim 10, wherein said first host is further operable to calculate a 

2 difference between said first period of time and said second period of time, and compare said 

3 difference to said predetermined threshold. 

1 12. The system of claim 1 1 , wherein said first host is operable to detect said failure of said 

2 second process in response to said difference exceeding said predetermined threshold. 
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9 



1 13. The system of claim 12, wherein said first process is operable to remove said second 

2 process from a view in response to detecting said failure of said second process. 

1 14. The system of claim 9, wherein said first host is operable to determine whether a 

2 heartbeat is received from at least one other host in said system prior to an expiration of a 

3 heartbeat timeout. 

1 15. The system of claim 14, wherein said first host is further operable to detect said failure of 

2 said network in response to not receiving a heartbeat from said at least one other host prior to 

3 said expiration of said heartbeat timeout. 
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